Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(spanner): fix negative values for max_in_use_sessions metrics #10449

Merged
merged 6 commits into from
Jun 28, 2024

Conversation

rahul2393
Copy link
Contributor

Internal bug: b/343756862

@rahul2393 rahul2393 requested review from a team as code owners June 27, 2024 08:12
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Jun 27, 2024
Comment on lines 1068 to 1069
if (sp.idleList.Len() +
int(sp.createReqs)) != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can this be on one line (I found it hard to read in the current form)

sh.recycle()
}

for true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add an escape hatch to the loop (e.g. stop after X time)? Now a future bug could cause this to loop for ever, which is harder to debug than a test failure after for example 2 seconds.

// Decrease the number of sessions in use.
p.decNumInUseLocked(ctx)
// Decrease the number of sessions in use, only when not from idle list.
if !isExpire {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better to add an additional argument to the remove(..) method that explicitly says whether the session was in use or not. So something like this:

func (p *sessionPool) remove(s *session, isExpire bool, wasInUse bool) {
  ...
  if wasInUse {
    p.decNumInUseLocked(ctx)
  }
}

In theory, it could be that this method is called in the future to remove sessions that have not expired, but that also were not in use at the time that they were being removed, and then we could re-introduce a similar bug as the one here. That is less likely with an explicit argument that clearly says what it is for.

@rahul2393 rahul2393 force-pushed the fix_negative_metrics branch from 208c629 to 53f7f2c Compare June 27, 2024 10:03
@rahul2393 rahul2393 force-pushed the fix_negative_metrics branch from 53f7f2c to 78e81d5 Compare June 27, 2024 10:04
@rahul2393 rahul2393 requested a review from olavloite June 27, 2024 10:04
@@ -907,7 +907,7 @@ func (p *sessionPool) close(ctx context.Context) {

func deleteSession(ctx context.Context, s *session, wg *sync.WaitGroup) {
defer wg.Done()
s.destroyWithContext(ctx, false)
s.destroyWithContext(ctx, false, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is deleteSession only called for sessions that are in use at that moment?

Note that inUse means that the session was checked out of the pool at the moment that this method is being called. So in this case it would mean that we are calling deleteSession(..) for a session that was checked out.

Copy link
Contributor Author

@rahul2393 rahul2393 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function to closeSession as it will only be called when doing application cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't that then mean that the metric will drop to a negative value for a (very) short time when the application is shutting down? Assume that the situation is that:

  1. The pool has 100 sessions.
  2. 10 of them are in use.
  3. The application shuts down and closes the client.
  4. The client closes all sessions and calls this method for all 100 sesssions.
  5. The inUseSessions metric drop from 10 to -90 for a very short time.

Copy link
Contributor Author

@rahul2393 rahul2393 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasInUse was set to false in latest commit, hence negative values should not appear, and graphs will just show last exported value.

@@ -197,7 +197,7 @@ func (sh *sessionHandle) destroy() {
p.trackedSessionHandles.Remove(tracked)
p.mu.Unlock()
}
s.destroy(false)
s.destroy(false, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is destroy() only called for sessions that are in use? I would expect that it could also be called for a session that is in the list of idle sessions, and in that case it was not in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sessionHandle is the wrapper which is created only for transactions(to be used) so we can assume any call for destroy using sessionHandle was in use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olavloite added comment for the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes it clearer, thanks!

@rahul2393 rahul2393 requested a review from olavloite June 27, 2024 10:44
@rahul2393 rahul2393 added the automerge Merge the pull request once unit tests and other checks pass. label Jun 27, 2024
@rahul2393 rahul2393 enabled auto-merge (squash) June 27, 2024 13:21
@rahul2393 rahul2393 merged commit a1e198a into main Jun 28, 2024
12 checks passed
@rahul2393 rahul2393 deleted the fix_negative_metrics branch June 28, 2024 07:14
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jun 28, 2024
rahul2393 added a commit that referenced this pull request Jul 16, 2024
 (#10508)

* fix(spanner): add debug log to print full stack trace when negative value happens

* skip decrementing num_in_use metric count when session is destroyed from healthchecks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants